For this machine learning lab, I am using the species Canis latrans to predict presence from observations and environmental data. There are five parts to this lab:
Canis Latrans in grassland habitat. Source Gerald and Buff Corsi
obs_csv <- file.path(dir_data, "obs.csv")
obs_geo <- file.path(dir_data, "obs.geojson")
redo <- FALSE
if (!file.exists(obs_geo) | redo){
# get species occurrence data from GBIF with coordinates
(res <- spocc::occ(
query = 'Canis latrans',
from = 'gbif',
has_coords = T,
limit = 10000))
# extract data frame from result
df <- res$gbif$data[[1]]
readr::write_csv(df, obs_csv)
# convert to points of observation from lon/lat columns in data frame
obs <- df %>%
sf::st_as_sf(
coords = c("longitude", "latitude"),
crs = st_crs(4326)) %>%
select(prov, key) # save space (joinable from obs_csv)
sf::write_sf(obs, obs_geo, delete_dsn=T)
}
obs <- sf::read_sf(obs_geo)
nrow(obs) # number of rows
## [1] 10000
# show points on map
mapview::mapview(obs, map.types = "Esri.WorldImagery")
dir_env <- file.path(dir_data, "env")
# set a default data directory
options(sdmpredictors_datadir = dir_env)
# choosing terrestrial
env_datasets <- sdmpredictors::list_datasets(terrestrial = TRUE, marine = FALSE)
# show table of datasets
env_datasets %>%
select(dataset_code, description, citation) %>%
DT::datatable()
# choose datasets for a vector
env_datasets_vec <- c("WorldClim", "ENVIREM")
# get layers
env_layers <- sdmpredictors::list_layers(env_datasets_vec)
DT::datatable(env_layers)
# choose layers after some inspection and perhaps consulting literature
env_layers_vec <- c("WC_alt", "WC_bio1", "WC_bio12", "ER_tri", "ER_topoWet")
# get layers
env_stack <- load_layers(env_layers_vec)
# interactive plot layers, hiding all but first (select others)
# mapview(env_stack, hide = T) # makes the html too big for Github
plot(env_stack, nc=2)
How many observations total are in GBIF for your species?
Answer: There are 37,929 total observations of Canis latrans in the GBIF database.
Did you have to perform any data cleaning steps?
Answer: I did not have to perform any data cleaning steps. At first visually, I do not see any odd observations. All observations were within the North American continent, which is the appropriate range for Canis latrans. However, when I zoomed into water regions such as the Great Lakes or the Gulf of St. Lawrence by Newfoundland - I saw some observations were in the water near the shore. At first I thought this was odd, but further research showed that Canis latrans are excellent swimmers and often swim about 0.5 miles off shore, and up to 7 or 8 miles in total. (Fun fact: Canis latrans have webbed feet!) All water observations appeared to be within 0.5 miles off shore and so I did not consider them odd and left them as is.
What environmental layers did you choose as predictors? Can you find any support for these in the literature?
Answer: I chose WC_alt, WC_bio1, WC_bio12, ER_tri, and ER_topoWet from the WorldClim and ENVIREM data sets. I found the WorldClim and ENVIREM data sets using sdmpredictors list_datasets based on the criteria of terrestrial and excluded data sets related to marine environments.
Canis latrans can be found in many different types of habitats such as forests, prairie, mountains, and even agriculture or urban areas. I chose predictors like altitude, annual mean temperature, annual precipitation, terrain roughness index, and topographic wetness to accommodate for the wide variety of habitats that Canis latrans can be found in (1, 2, 3).
obs_hull_geo <- file.path(dir_data, "obs_hull.geojson")
env_stack_grd <- file.path(dir_data, "env_stack.grd")
if (!file.exists(obs_hull_geo) | redo){
# make convex hull around points of observation
obs_hull <- sf::st_convex_hull(st_union(obs))
# save obs hull
write_sf(obs_hull, obs_hull_geo)
}
obs_hull <- read_sf(obs_hull_geo)
# show points on map
mapview(
list(obs, obs_hull))
Applied convex hull to chosen environmental layers.
if (!file.exists(env_stack_grd) | redo){
obs_hull_sp <- sf::as_Spatial(obs_hull)
env_stack <- raster::mask(env_stack, obs_hull_sp) %>%
raster::crop(extent(obs_hull_sp))
writeRaster(env_stack, env_stack_grd, overwrite=T)
}
env_stack <- stack(env_stack_grd)
# show map
# mapview(obs) +
# mapview(env_stack, hide = T) # makes html too big for Github
plot(env_stack, nc=2)
Based on convex hull, create a mask of region of interest and then generate random points inside mask.
absence_geo <- file.path(dir_data, "absence.geojson")
pts_geo <- file.path(dir_data, "pts.geojson")
pts_env_csv <- file.path(dir_data, "pts_env.csv")
if (!file.exists(absence_geo) | redo){
# get raster count of observations
r_obs <- rasterize(
sf::as_Spatial(obs), env_stack[[1]], field=1, fun='count')
# show map
# mapview(obs) +
# mapview(r_obs)
# create mask for
r_mask <- mask(env_stack[[1]] > -Inf, r_obs, inverse=T)
# generate random points inside mask
absence <- dismo::randomPoints(r_mask, nrow(obs)) %>%
as_tibble() %>%
st_as_sf(coords = c("x", "y"), crs = 4326)
write_sf(absence, absence_geo, delete_dsn=T)
}
absence <- read_sf(absence_geo)
# show map of presence, ie obs, and absence
mapview(obs, col.regions = "green") +
mapview(absence, col.regions = "gray")
Create a data table that will feed into the SDM where:
y is the present column with values of 1 (present) or 0 (absent).x is all other columns.if (!file.exists(pts_env_csv) | redo){
# combine presence and absence into single set of labeled points
pts <- rbind(
obs %>%
mutate(
present = 1) %>%
select(present, key),
absence %>%
mutate(
present = 0,
key = NA)) %>%
mutate(
ID = 1:n()) %>%
relocate(ID)
write_sf(pts, pts_geo, delete_dsn=T)
# extract raster values for points
pts_env <- raster::extract(env_stack, as_Spatial(pts), df=TRUE) %>%
tibble() %>%
# join present and geometry columns to raster value results for points
left_join(
pts %>%
select(ID, present),
by = "ID") %>%
relocate(present, .after = ID) %>%
# extract lon, lat as single columns
mutate(
#present = factor(present),
lon = st_coordinates(geometry)[,1],
lat = st_coordinates(geometry)[,2]) %>%
select(-geometry)
write_csv(pts_env, pts_env_csv)
}
pts_env <- read_csv(pts_env_csv)
pts_env %>%
# show first 10 presence, last 10 absence
slice(c(1:10, (nrow(pts_env)-9):nrow(pts_env))) %>%
DT::datatable(
rownames = F,
options = list(
dom = "t",
pageLength = 20))
pts_env %>%
select(-ID) %>%
mutate(
present = factor(present)) %>%
pivot_longer(-present) %>%
ggplot() +
geom_density(aes(x = value, fill = present)) +
scale_fill_manual(values = alpha(c("gray", "green"), 0.5)) +
scale_x_continuous(expand=c(0,0)) +
scale_y_continuous(expand=c(0,0)) +
theme_bw() +
facet_wrap(~name, scales = "free") +
theme(
legend.position = c(1, 0),
legend.justification = c(1, 0))
GGallyThe pairs plots shows correlations between variables. Based on these plots I would consider dropping ER_topoWet and ER_tri because they are highly correlated with WC_bio1.
GGally::ggpairs(
select(pts_env, -ID),
aes(color = factor(present)))
# setup model data
d <- pts_env %>%
select(-ID) %>% # remove terms we don't want to model
tidyr::drop_na() # drop the rows with NA values
nrow(d)
## [1] 19931
# fit a linear model
mdl <- lm(present ~ ., data = d)
summary(mdl)
##
## Call:
## lm(formula = present ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14695 -0.41599 0.02142 0.43100 1.53519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.412e+00 8.532e-02 16.548 < 2e-16 ***
## WC_alt -1.851e-04 9.285e-06 -19.931 < 2e-16 ***
## WC_bio1 2.394e-02 1.870e-03 12.806 < 2e-16 ***
## WC_bio12 -2.121e-04 9.299e-06 -22.804 < 2e-16 ***
## ER_tri -2.098e-03 1.852e-04 -11.331 < 2e-16 ***
## ER_topoWet -9.068e-02 3.881e-03 -23.364 < 2e-16 ***
## lon 7.571e-04 2.971e-04 2.548 0.0108 *
## lat 6.830e-03 1.420e-03 4.809 1.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4623 on 19923 degrees of freedom
## Multiple R-squared: 0.1456, Adjusted R-squared: 0.1453
## F-statistic: 485.1 on 7 and 19923 DF, p-value: < 2.2e-16
y_predict <- predict(mdl, d, type="response")
y_true <- d$present
range(y_predict)
## [1] -0.5351872 1.1469468
range(y_true)
## [1] 0 1
# fit a generalized linear model with a binomial logit link function
mdl <- glm(present ~ ., family = binomial(link="logit"), data = d)
summary(mdl)
##
## Call:
## glm(formula = present ~ ., family = binomial(link = "logit"),
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5407 -1.0206 0.2981 1.0386 3.2494
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.289e+00 4.236e-01 10.125 < 2e-16 ***
## WC_alt -8.787e-04 4.576e-05 -19.202 < 2e-16 ***
## WC_bio1 1.207e-01 9.403e-03 12.835 < 2e-16 ***
## WC_bio12 -1.083e-03 4.904e-05 -22.091 < 2e-16 ***
## ER_tri -1.089e-02 9.704e-04 -11.220 < 2e-16 ***
## ER_topoWet -4.387e-01 1.917e-02 -22.890 < 2e-16 ***
## lon 5.145e-03 1.422e-03 3.617 0.000298 ***
## lat 3.995e-02 7.150e-03 5.587 2.31e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27630 on 19930 degrees of freedom
## Residual deviance: 24454 on 19923 degrees of freedom
## AIC: 24470
##
## Number of Fisher Scoring iterations: 4
y_predict <- predict(mdl, d, type="response")
range(y_predict)
## [1] 0.005095279 0.960343222
# show term plots
termplot(mdl, partial.resid = TRUE, se = TRUE, main = F)
librarian::shelf(mgcv)
# fit a generalized additive model with smooth predictors
mdl <- mgcv::gam(
formula = present ~ s(WC_alt) + s(WC_bio1) +
s(WC_bio12) + s(ER_tri) + s(ER_topoWet) + s(lon) + s(lat),
family = binomial, data = d)
summary(mdl)
##
## Family: binomial
## Link function: logit
##
## Formula:
## present ~ s(WC_alt) + s(WC_bio1) + s(WC_bio12) + s(ER_tri) +
## s(ER_topoWet) + s(lon) + s(lat)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.13417 0.03173 -4.228 2.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(WC_alt) 7.702 8.534 736.98 < 2e-16 ***
## s(WC_bio1) 7.201 7.808 601.92 < 2e-16 ***
## s(WC_bio12) 6.432 7.054 826.53 < 2e-16 ***
## s(ER_tri) 8.714 8.929 60.00 < 2e-16 ***
## s(ER_topoWet) 8.823 8.986 38.83 7.13e-06 ***
## s(lon) 8.525 8.931 572.01 < 2e-16 ***
## s(lat) 8.844 8.990 269.12 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.402 Deviance explained = 34.9%
## UBRE = -0.091154 Scale est. = 1 n = 19931
# show term plots
plot(mdl, scale=0)
Which GAM environmental variables, and even range of values, seem to contribute most towards presence (above 0 response) versus absence (below 0 response)?
Answer: The GAM shows that all environmental variables are statistically significant and contribute a lot to presence. Latitude and ER_topoWet appear to contribute the most to presence.
# load extra packages
librarian::shelf(
maptools, sf)
# show version of maxent
maxent()
## This is MaxEnt version 3.4.3
# get environmental rasters
# NOTE: the first part of Lab 1. SDM - Explore got updated to write this clipped environmental raster stack
env_stack_grd <- file.path(dir_data, "env_stack.grd")
env_stack <- stack(env_stack_grd)
plot(env_stack, nc=2)
# get presence-only observation points (maxent extracts raster values for you)
obs_geo <- file.path(dir_data, "obs.geojson")
obs_sp <- read_sf(obs_geo) %>%
sf::as_Spatial() # maxent prefers sp::SpatialPoints over newer sf::sf class
# fit a maximum entropy model
mdl <- maxent(env_stack, obs_sp)
## This is MaxEnt version 3.4.3
# plot variable contributions per predictor
plot(mdl)
# plot term plots
response(mdl)
# predict
y_predict <- predict(env_stack, mdl) #, ext=ext, progress='')
plot(y_predict, main='Maxent, raw prediction')
data(wrld_simpl, package="maptools")
plot(wrld_simpl, add=TRUE, border='dark grey')
Which Maxent environmental variables, and even range of values, seem to contribute most towards presence (closer to 1 response) and how might this differ from the GAM results?
Answer: WC_alt, WC_bio1, and WC_bio12 seem to contribute the most toward presence, while ER_tri and ER_topoWet seem to contribute the least. Important to note that WC_bio1 contributes far more than any other variables (greater than 60%). This differs significantly from the GAM results because….
Chamberlain, Michael J., et al. “Spatial-Use Patterns, Movements, and Interactions among Adult Coyotes in Central Mississippi.” Canadian Journal of Zoology, vol. 78, no. 12, Dec. 2000, pp. 2087–95. DOI.org (Crossref), https://doi.org/10.1139/z00-154.
Gese, Eric M., et al. “Home Range and Habitat Use of Coyotes in Southeastern Colorado.” The Journal of Wildlife Management, vol. 52, no. 4, Oct. 1988, p. 640. DOI.org (Crossref), https://doi.org/10.2307/3800923.
Hinton, Joseph W., et al. “Space Use and Habitat Selection by Resident and Transient Coyotes (Canis Latrans).” PLOS ONE, edited by Marco Apollonio, vol. 10, no. 7, July 2015, p. e0132203. DOI.org (Crossref), https://doi.org/10.1371/journal.pone.0132203.